Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distributed evaluation #176

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mehdidc
Copy link
Contributor

@mehdidc mehdidc commented Sep 24, 2022

Currently, evaluation is done on rank zero. This PR provides support for distributed evaluation (using an optional --distributed-evaluation argument) to make evaluation faster (supports both zero-shot and retrieval).

@rom1504
Copy link
Collaborator

rom1504 commented Nov 7, 2022

this LGTM

could you please double check @mitchellnw or @rwightman ?

@mitchellnw
Copy link
Contributor

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

assert wandb is not None, 'Please install wandb.'
for name, val in metrics.items():
wandb.log({f"val/{name}": val, 'epoch': epoch})

return metrics


def get_metrics(image_features, text_features, logit_scale):
def get_metrics(image_features, text_features, logit_scale, args):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick but maybe better to pass a more specific argument e.g. gather_tensor=args.distributed_evaluation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick but maybe better to pass a more specific argument e.g. gather_tensor=args.distributed_evaluation?

Thanks I agree, I think it's better

@@ -359,7 +361,8 @@ def get_wds_dataset(args, preprocess_img, is_train, epoch=0, floor=False):
num_samples = num_batches * global_batch_size
dataset = dataset.with_epoch(num_worker_batches) # each worker is iterating over this
else:
# last batches are partial, eval is done on single (master) node
if args.distributed_evaluation:
num_samples = num_samples // args.world_size
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm misunderstanding, but does it skip the last few samples due to the last partial batch?

Copy link
Contributor Author

@mehdidc mehdidc Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm misunderstanding, but does it skip the last few samples due to the last partial batch?

in evaluation, num_samples is only used for logging :

samples_per_val = dataloader.num_samples
,
f"Eval Epoch: {epoch} [{num_samples} / {samples_per_val}]\t"

it is not affecting the dataloader which depends only on the wds pipeline. But it would be good to get it correct anyway, have to check how many examples each worker exactly receive.

@mehdidc
Copy link
Contributor Author

mehdidc commented Nov 14, 2022

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

Thanks for the link, not sure why they do drop_last=True on val_loader (not used here), probably to avoid having a GPU worker with much fewer examples than the others? so rather, they seem to do drop_last, and compute the last few examples validation performance in all GPU workers.

@mitchellnw
Copy link
Contributor

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

@mehdidc
Copy link
Contributor Author

mehdidc commented Nov 15, 2022

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

I see, thanks @mitchellnw! OK so I need to fix this. I really thought that DistributedSampler with drop_last=False would do the the "right" thing (although wouldn't be ideal for disributed setting if it would be case) in the sense of seing each example exactly once (even when dataset size is not divisible by nb of workers, e.g. the remaining examples can be distributed to a subset of workers)

@dmlpt
Copy link

dmlpt commented Feb 15, 2023

Is this argument "--distributed_evaluation" not available in the current version?

@mehdidc
Copy link
Contributor Author

mehdidc commented Feb 19, 2023

@dmlpt Not yet, I still need to fix the val dataloader like @mitchellnw mentions and rebase on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants